The short text has been the prevalent format for information of Internet inrecent decades, especially with the development of online social media, whosemillions of users generate a vast number of short messages everyday. Althoughsophisticated signals delivered by the short text make it a promising sourcefor topic modeling, its extreme sparsity and imbalance brings unprecedentedchallenges to conventional topic models like LDA and its variants. Aiming atpresenting a simple but general solution for topic modeling in short texts, wepresent a word co-occurrence network based model named WNTM to tackle thesparsity and imbalance simultaneously. Different from previous approaches, WNTMmodels the distribution over topics for each word instead of learning topicsfor each document, which successfully enhance the semantic density of dataspace without importing too much time or space complexity. Meanwhile, the richcontextual information preserved in the word-word space also guarantees itssensitivity in identifying rare topics with convincing quality. Furthermore,employing the same Gibbs sampling with LDA makes WNTM easily to be extended tovarious application scenarios. Extensive validations on both short and normaltexts testify the outperformance of WNTM as compared to baseline methods. Andfinally we also demonstrate its potential in precisely discovering newlyemerging topics or unexpected events in Weibo at pretty early stages.
展开▼